PCA is a tool to represent high dimensional data in a lower dimension with as much information as possible. PCA first finds the dimension that has the most variance which becomes the first principle component (PC), afterwards the orthogonal dimension with the most variance becomes the second PC and so on. Overall in a D-dimensional data set, there are D PC’s, but with most packages, only the first 50 are given by default.
Disclaimer: PCA assumes that variables are centered to have zero mean. It is also good idea for the variables to have standard deviation 1 before performing PCA.
So the first Principle Component is found by: \[ \begin{aligned} \max_{\phi_{11},...,\phi_{p1}} &\left \{ \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij} \right) \right\} \\ \text{subject to:} \quad &\sum_{j=1}^p \phi_{j1}^2 = 1 \end{aligned} \] where \(n\) is the number of observations, \(p\) is the number of features (dimensions), \(\phi_{jm}\) is element \(j\) of the loading vector for PC \(m\), and \(x_{ij}\) is feature (here gene) \(j\) of observation \(i\).
Each PC explains a proportion of the full variance in the data set, with the first PC explaining the most. The variance explained by the \(m\)th principal component is: \[ PVE_m = \frac{1}{n} \sum_{i=1}^n z_{im}^2 = \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jm} x_{ij} \right)^2 \] Visualizing these variance explained values in a plot will give a scree plot. A cumulative scree plot is great for visualizing how many PCs should be used for the wished proportion of variance explained.
PCA is only a good dimension-reduction method for visualization, when the first 2 PC’s account for most of the variation in the data.
library(tidyverse)
library(anndata)
library(Seurat)
library(pdfCluster)
# setting working directory and loading the pilot set
setwd("~/sc_covid_PiB2023/data_science_project/Code")
Pilot <- read_h5ad('../Data/Pilot_2_rule_them_ALL.h5ad')
# for more information on data see Clean_data_subsetting.Rmd or Clean_summary_statistics.Rmd
We’ve chosen to use RunPCA from Seurat. We
need the expression matrix as \(cell \times
genes\)
dim(Pilot$X)
## [1] 5694 23382
If you see Clean_Data_subsetting.Rmd you can see this is
the case with 5694 rows of cells and 23382 coloumns of genes.
PCA is then run on the data:
PCA <- RunPCA(Pilot$X, assay = NULL, npcs = 50, rev.pca = F, weight.by.var = T, verbose = F)
# Nice to know:
# npcs = number of PCs (we use 50 because it was the default),
# rev.pca = do we have to do reverse pca? i.e. is the expression matrix genes x cells?
# weight.by.var indicats whether we standerdise variance for the variables
Now that we’ve calculated the first 50 PCs we’ll save it as metadata for the observations (for future use):
Pilot$obsm$X_pca <- PCA[]
write_h5ad(Pilot, '../Data/Pilot_2_rule_them_ALL.h5ad')
Pilot <- read_h5ad('../Data/Pilot_2_rule_them_ALL.h5ad')
Before we can plot this we need to convert it to a dataframe such that it is compatible with ggplot:
df <- as.data.frame(PCA[]) # this could also be done with "Pilot$obsm$X_pca" for same result (other than column-names)
# appending all annotations:
df <- cbind(df,Pilot$obs)
# See 51:58 for annotations
#head(df[,1:58])
We can look a scree plots to find out how much of the variance in the data is explained by each PC:
# PCA@stdev is standard diviation for each PC
head(PCA@stdev) # drops in value as expected (we see PC_1 has a LOT of the sd)
## [1] 19.084890 6.237265 4.686928 3.435942 3.031849 2.642807
var <- PCA@stdev^2 # variance of each PC
total_var <- sum(var) # total variance of all PC
var_explained <- var/total_var # amount explained by each PC of the total
cum_var <- cumsum(var_explained) # amount explained cumulatively
# for visualization:
scree <- cbind(PC = c(1:50), var, var_explained, cum_var)
scree <- as.data.frame(scree)
Scree plot of varience explained per PC:
npcs = 10 # number of PCs to include in scree plot
qplot(c(1:npcs), var_explained[1:npcs]) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot (contibution pr PC)")
Scree plot of amount of varience explained cumulatively by adding each PC:
npcs = 10 # number of PCs to include in scree plot
qplot(c(1:npcs), cum_var[1:npcs]) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot (cumulatively)")
Since the first principle component explains a majority of the varience it might be interesting to look at what genes are the most responsible for driving this:
gene_loadings <- abs(as.array(PCA@cell.embeddings[,1]))
largest_indices <- order(gene_loadings, decreasing = T)[1:5]
gene_loadings[largest_indices]
## MALAT1 B2M MT-CO2 FTL MT-CO1
## 341.9402 330.1053 315.6178 305.2956 294.7431
Are five genes most responsible for driveing the first PC. B2M (Beta-2-Microglobulin) has been associated with SARS-CoV-2 and other infectious diseases.
In this section we will try to visualise how well the first 2 PCs capture the different annotations:
ggplot(df, aes(x = PC_1, y = PC_2, color = cellType)) +
geom_point(alpha = 0.8) +
ggtitle("PCA colored by cellType") +
xlab("Principal Component 1") +
ylab("Principal Component 2")
We’ll also test how well k-means clustering can identify some of the annotations. For this we will however allow for multiple PCs to be used to give the best ARI score:
# initializing:
k = length(unique(Pilot$obs$cellType)) # number of celltypes as k
max_ARI = -1
nr_PCs = 0
for (i in 2:50){ # wwhere i is nr of included PCs to cluster on
cluster <- kmeans(PCA[,1:i], iter = 10000, nstart = 100, k, algorithm="MacQueen") # k-means
#(we use MacQueen since it is more computationally viable and runs into fewer errors)
ARI = adj.rand.index(cluster$cluster, Pilot$obs$cellType) # ARI score
if (ARI > max_ARI){ # if we find a top score we note it
max_ARI <- ARI
nr_PCs <- i
df$cluster <- factor(cluster$cluster)
}
}
# we then plot the best ARI and the corresponding clustering:
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Kmeans on PCA", subtitle = "Clustered by celltype",
caption = paste("K =", k, " | PCS used =", nr_PCs, " | ARI =", round(max_ARI, 4),collapse=' '))
We do the same for hierarchical clustering:
k = length(unique(Pilot$obs$cellType)) # number of clusters
max_ARI = -1
nr_PCs = 0
for (i in c(2,3,4,6,10,15,20,30,50)){ # if we runn through all PCs it takes to long, so I've chosen some (The best result is often within the first few PCs)
for (m in c("complete", "single", "average", "centroid")){ # here we try all the different kinds of linkage
dm <- dist(as.data.frame(PCA[,1:i])) # Distance matrix
hc <- hclust(dm, method = m) # hierarchical clustering resulting in dendrogram
clusterCutS <- cutree(hc, k) # Clustering result based on (data, number of clusters)
ARI = adj.rand.index(clusterCutS, Pilot$obs$cellType)
if (max_ARI< ARI) {
best_linkage = m
max_ARI = ARI
best_hc = hc
nr_PCs = i
df$cluster <- factor(clusterCutS)
}
}
}
# we plot the best ARI
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Hierarchical on PCA", subtitle = "Clustered by celltype",
caption = paste("K =", k, " | PCS used =", nr_PCs,
" | ARI =", round(max_ARI, 4), " | linkage =", m,collapse=' '))
Repeat of the above but with infection status instead:
df %>%
arrange(viralLoad) %>%
ggplot(aes(x = PC_1, y = PC_2, color = Is_infected)) +
geom_point(alpha = 0.3) +
scale_color_manual(values = c("0" = "blue", "1" = "red")) +
ggtitle("PCA colored by infection status") +
xlab("Principal Component 1") +
ylab("Principal Component 2")
# for more indepth comments see ealier chunk "k-means"
k = 2 # infected or not as k ie k=2
max_ARI = -1
nr_PCs = 0
for (i in 2:50){
cluster <- kmeans(PCA[,1:i], iter = 10000, nstart = 100, k, algorithm="MacQueen")
ARI = adj.rand.index(cluster$cluster, Pilot$obs$Is_infected)
if (ARI > max_ARI){
max_ARI <- ARI
nr_PCs <- i
df$cluster <- factor(cluster$cluster)
}
}
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Kmeans on PCA", subtitle = "Clustered by infection",
caption = paste("K =", k, " | PCS used =", nr_PCs, " | ARI =", round(max_ARI, 4),collapse=' '))
# for more indepth comments see ealier chunk "hierarchical"
k = 2
max_ARI = -1
nr_PCs = 0
for (i in c(2,3,4,6,10,15,20,30,50)){
for (m in c("complete", "single", "average", "centroid")){
dm <- dist(as.data.frame(PCA[,1:i])) # Distance matrix
hc <- hclust(dm, method = m) # simple dendrogram
clusterCutS <- cutree(hc, k) # data, number of clusters
ARI = adj.rand.index(clusterCutS, Pilot$obs$Is_infected)
if (max_ARI< ARI) {
best_linkage = m
max_ARI = ARI
best_hc = hc
nr_PCs = i
df$cluster <- factor(clusterCutS)
}
}
}
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Hierarchical on PCA", subtitle = "Clustered by infection",
caption = paste("K =", k, " | PCS used =", nr_PCs,
" | ARI =", round(max_ARI, 4), " | linkage =", m,collapse=' '))
Not very good at identifying if cells are infected. This makes sense however since most of the differences between cells are probably due to difference in celltype.
Repeat of Celltype and Infection status
ggplot(df, aes(x = PC_1, y = PC_2, color = PatientID)) +
geom_point(alpha = 0.5) +
ggtitle("PCA colored by Patient") +
xlab("Principal Component 1") +
ylab("Principal Component 2")
# for more indepth comments see ealier chunk "k-means"
k = length(unique(Pilot$obs$PatientID)) # nr of patients as k
max_ARI = -1
nr_PCs = 0
for (i in 2:50){
cluster <- kmeans(PCA[,1:i], iter = 10000, nstart = 100, k, algorithm="MacQueen")
ARI = adj.rand.index(cluster$cluster, Pilot$obs$PatientID)
if (ARI > max_ARI){
max_ARI <- ARI
nr_PCs <- i
df$cluster <- factor(cluster$cluster)
}
}
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Kmeans on PCA", subtitle = "Clustered by patient",
caption = paste("K =", k, " | PCS used =", nr_PCs, " | ARI =", round(max_ARI, 4),collapse=' '))
# for more indepth comments see ealier chunk "hierarchical"
k = length(unique(Pilot$obs$PatientID))
max_ARI = -1
nr_PCs = 0
for (i in c(2,3,4,6,10,15,20,30,50)){
for (m in c("complete", "single", "average", "centroid")){
dm <- dist(as.data.frame(PCA[,1:i])) # Distance matrix
hc <- hclust(dm, method = m) # simple dendrogram
clusterCutS <- cutree(hc, k) # data, number of clusters
ARI = adj.rand.index(clusterCutS, Pilot$obs$PatientID)
if (max_ARI< ARI) {
best_linkage = m
max_ARI = ARI
best_hc = hc
nr_PCs = i
df$cluster <- factor(clusterCutS)
}
}
}
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Hierarchical on PCA", subtitle = "Clustered by Patient",
caption = paste("K =", k, " | PCS used =", nr_PCs,
" | ARI =", round(max_ARI, 4), " | linkage =", m,collapse=' '))
ggplot(df, aes(x = PC_1, y = PC_2, color = sampleID)) +
geom_point(alpha = 0.5) +
ggtitle("PCA colored by sample") +
xlab("Principal Component 1") +
ylab("Principal Component 2")
# for more indepth comments see ealier chunk "k-means"
k = length(unique(Pilot$obs$sampleID))
max_ARI = -1
nr_PCs = 0
for (i in 2:50){
cluster <- kmeans(PCA[,1:i], iter = 10000, nstart = 100, k, algorithm="MacQueen")
ARI = adj.rand.index(cluster$cluster, Pilot$obs$sampleID)
if (ARI > max_ARI){
max_ARI <- ARI
nr_PCs <- i
df$cluster <- factor(cluster$cluster)
}
}
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Kmeans on PCA", subtitle = "Clustered by sample",
caption = paste("K =", k, " | PCS used =", nr_PCs, " | ARI =", round(max_ARI, 4),collapse=' '))
# for more indepth comments see ealier chunk "hierarchical"
k = length(unique(Pilot$obs$sampleID))
max_ARI = -1
nr_PCs = 0
for (i in c(2,3,4,6,10,15,20,30,50)){
for (m in c("complete", "single", "average", "centroid")){
dm <- dist(as.data.frame(PCA[,1:i])) # Distance matrix
hc <- hclust(dm, method = m) # simple dendrogram
clusterCutS <- cutree(hc, k) # data, number of clusters
ARI = adj.rand.index(clusterCutS, Pilot$obs$sampleID)
if (max_ARI< ARI) {
best_linkage = m
max_ARI = ARI
best_hc = hc
nr_PCs = i
df$cluster <- factor(clusterCutS)
}
}
}
ggplot(df, aes(x=PC_1, y=PC_2, color=cluster)) +
geom_point() +
labs(title = "Hierarchical on PCA", subtitle = "Clustered by sample",
caption = paste("K =", k, " | PCS used =", nr_PCs,
" | ARI =", round(max_ARI, 4), " | linkage =", m,collapse=' '))
It may be more interesting to look at this for each celltype to see if infection is more identifiable when the difference between celltypes is out of the picture.
In this loop we run PCA on each cohort of celltypes:celltypes <- unique(Pilot$obs$cellType)
df <- as.data.frame(Pilot$obs) # we create a dataframe to hold the cell embeddings
for (i in 1:50) { # this way it's easier to use ggplot and to run subsequent clustering
col_name <- paste0("PC_", i)
df[[col_name]] <- 0}
genes <- c() # This is just a list for saving the 5 most responsible genes pr PC 1 and 2 pr Celltype
for (c in celltypes){ # running PCA for each celltype
c_subset = Pilot[which(Pilot$obs$cellType == c)] # subset consisting only of celtype c
PCA <- RunPCA(c_subset$X, assay = NULL, npcs = 50, rev.pca = F, weight.by.var = T, verbose = F) # PCA on subset
# saving the 5 most "important" genes for PC 1 and 2 pr celltype
gene_loadings_1 <- abs(as.array(PCA@cell.embeddings[,1]))
gene_loadings_2 <- abs(as.array(PCA@cell.embeddings[,2]))
largest_indices_1 <- order(gene_loadings_1, decreasing = T)[1:5]
largest_indices_2 <- order(gene_loadings_2, decreasing = T)[1:5]
genes <- append(genes, c(
gene_loadings_1[largest_indices_1],
gene_loadings_2[largest_indices_2]
))
# saving the PCs. following few lines are failed attemts at doing this a smart way:
#df <- df %>%
#mutate(across(starts_with("PC"), ~ ifelse(cellType == c, PCA[, as.numeric(gsub("PC_", "", cur_column()))], .)))
#for (i in 1:50) {
#col_name <- paste0("PC_", i)
#df[df$cellType == c,][[col_name]] <- PCA[, i]}
# Ended up "manually" putting it in since i've used wayyyy to much time trying to do it a "smart" way
df[df$cellType == c,]$PC_1 <- PCA[, 1]
df[df$cellType == c,]$PC_2 <- PCA[, 2]
df[df$cellType == c,]$PC_3 <- PCA[, 3]
df[df$cellType == c,]$PC_4 <- PCA[, 4]
df[df$cellType == c,]$PC_5 <- PCA[, 5]
df[df$cellType == c,]$PC_6 <- PCA[, 6]
df[df$cellType == c,]$PC_7 <- PCA[, 7]
df[df$cellType == c,]$PC_8 <- PCA[, 8]
df[df$cellType == c,]$PC_9 <- PCA[, 9]
df[df$cellType == c,]$PC_10 <- PCA[, 10]
df[df$cellType == c,]$PC_11 <- PCA[, 11]
df[df$cellType == c,]$PC_12 <- PCA[, 12]
df[df$cellType == c,]$PC_13 <- PCA[, 13]
df[df$cellType == c,]$PC_14 <- PCA[, 14]
df[df$cellType == c,]$PC_15 <- PCA[, 15]
df[df$cellType == c,]$PC_16 <- PCA[, 16]
df[df$cellType == c,]$PC_17 <- PCA[, 17]
df[df$cellType == c,]$PC_18 <- PCA[, 18]
df[df$cellType == c,]$PC_19 <- PCA[, 19]
df[df$cellType == c,]$PC_20 <- PCA[, 20]
df[df$cellType == c,]$PC_21 <- PCA[, 21]
df[df$cellType == c,]$PC_22 <- PCA[, 22]
df[df$cellType == c,]$PC_23 <- PCA[, 23]
df[df$cellType == c,]$PC_24 <- PCA[, 24]
df[df$cellType == c,]$PC_25 <- PCA[, 25]
df[df$cellType == c,]$PC_26 <- PCA[, 26]
df[df$cellType == c,]$PC_27 <- PCA[, 27]
df[df$cellType == c,]$PC_28 <- PCA[, 28]
df[df$cellType == c,]$PC_29 <- PCA[, 29]
df[df$cellType == c,]$PC_30 <- PCA[, 30]
df[df$cellType == c,]$PC_31 <- PCA[, 31]
df[df$cellType == c,]$PC_32 <- PCA[, 32]
df[df$cellType == c,]$PC_33 <- PCA[, 33]
df[df$cellType == c,]$PC_34 <- PCA[, 34]
df[df$cellType == c,]$PC_35 <- PCA[, 35]
df[df$cellType == c,]$PC_36 <- PCA[, 36]
df[df$cellType == c,]$PC_37 <- PCA[, 37]
df[df$cellType == c,]$PC_38 <- PCA[, 38]
df[df$cellType == c,]$PC_39 <- PCA[, 39]
df[df$cellType == c,]$PC_40 <- PCA[, 40]
df[df$cellType == c,]$PC_41 <- PCA[, 41]
df[df$cellType == c,]$PC_42 <- PCA[, 42]
df[df$cellType == c,]$PC_43 <- PCA[, 43]
df[df$cellType == c,]$PC_44 <- PCA[, 44]
df[df$cellType == c,]$PC_45 <- PCA[, 45]
df[df$cellType == c,]$PC_46 <- PCA[, 46]
df[df$cellType == c,]$PC_47 <- PCA[, 47]
df[df$cellType == c,]$PC_48 <- PCA[, 48]
df[df$cellType == c,]$PC_49 <- PCA[, 49]
df[df$cellType == c,]$PC_50 <- PCA[, 50]
}
a <- -4
b <- 0
for (c in celltypes){
a <- a+5
b <- b+5
print(paste(c, "PC_1:", sep = " "))
print(genes[a:b])
a <- a+5
b <- b+5
print(paste(c, "PC_2:", sep = " "))
print(genes[a:b])
}
<!--- [1] "Macrophage PC_1:"
<!--- FTL FTH1 B2M MALAT1 TMSB10
<!--- 254.1367 221.4540 215.8587 209.7457 194.6836
<!--- [1] "Macrophage PC_2:"
<!--- S100A8 ISG15 MALAT1 CCL3L1 SAT1
<!--- 46.96770 46.79103 41.25764 37.67953 36.68414
<!--- [1] "Plasma PC_1:"
<!--- MT-CO2 MT-CO1 B2M MALAT1 RPLP1
<!--- 99.34336 95.17123 92.16644 90.54113 84.99634
<!--- [1] "Plasma PC_2:"
<!--- MTRNR2L12 S100A6 MT-CO1 MT-ND4 FTL
<!--- 40.41325 35.20989 32.81122 32.45974 30.71408
<!--- [1] "Secretory PC_1:"
<!--- MT-CO2 S100A6 MALAT1 SLPI MT-CO1
<!--- 158.3304 153.8221 152.7251 147.1263 143.4316
<!--- [1] "Secretory PC_2:"
<!--- FTH1 MT2A KRT17 IGFBP3 F3
<!--- 28.99001 25.98904 22.87755 20.41883 20.39639
<!--- [1] "Neutrophil PC_1:"
<!--- MALAT1 FTH1 B2M FTL SAT1
<!--- 150.4702 143.0461 135.3511 131.3739 128.1927
<!--- [1] "Neutrophil PC_2:"
<!--- S100A10 CCL2 CTSL RPL39 IFI27
<!--- 29.81841 28.05888 26.63028 26.41258 25.96694
<!--- [1] "NK PC_1:"
<!--- MALAT1 MT-CO2 B2M MT-CO1 MTRNR2L12
<!--- 84.21303 81.04485 78.29822 72.21772 67.32077
<!--- [1] "NK PC_2:"
<!--- ATP5F1A NEAT1 PPARA MT-ND4 PARD6B
<!--- 19.53952 19.22075 18.31997 17.73392 17.25745
<!--- [1] "T PC_1:"
<!--- MALAT1 B2M MT-CO2 RPL41 MT-CO1
<!--- 77.28206 70.22100 68.11936 64.38375 64.21668
<!--- [1] "T PC_2:"
<!--- NEAT1 RPL30 RPS27 MT-ND4 RPS8
<!--- 15.29269 12.76018 12.36423 12.01363 11.97450
<!--- [1] "Ciliated PC_1:"
<!--- MT-CO2 MT-ND3 B2M MALAT1 S100A6
<!--- 89.99186 81.66857 81.33299 81.03472 80.42279
<!--- [1] "Ciliated PC_2:"
<!--- MTRNR2L12 MTRNR2L8 FTH1 FTL AD000090.1
<!--- 26.91702 25.21887 22.37054 21.70132 19.66793
<!--- [1] "Squamous PC_1:"
<!--- MALAT1 MT-CO2 B2M MT-CO1 ACTB
<!--- 58.94801 49.35961 46.72385 46.59653 45.18565
<!--- [1] "Squamous PC_2:"
<!--- SPRR3 S100A9 S100A8 MAL ECM1
<!--- 33.41522 33.07228 27.17886 26.44136 25.91169
We now plot the infection status with each celltype as a facet of the plot:
df %>%
arrange(viralLoad) %>%
ggplot(aes(x = PC_1, y = PC_2 , color = Is_infected)) +
geom_point(size = 1.5, alpha = 0.6) +
facet_wrap("cellType", scales = "free", labeller = labeller()) +
scale_color_manual(values = c("0" = "blue", "1" = "red")) +
theme(axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(title = "PCA cell type") +
xlab("PC_1") +
ylab("PC_2")
We’ll now try to cluster just like we did previously. This time however we do it for each celltype. First using K-means:
df["cluster"] <- 0 # column indicating what cluster a cell belongs to based on whaterver clustering
k = 2 # infected or not as k ie k=2
for (c in celltypes){ # for every celltype
# initialize some values
max_ARI = -1
nr_PCs = 0
c_subset = df[df$cellType == c,] # subsetting all celltypes c
for (i in 2:50){ # for i amount of included PCs to cluster on
cluster <- kmeans(c_subset[, 9:8+i], iter = 10000, nstart = 100, k, algorithm="MacQueen") # kmeans
ARI = adj.rand.index(cluster$cluster, c_subset$Is_infected) # ARI score
if (ARI > max_ARI){ # if we find a new highscore we update
max_ARI <- ARI
nr_PCs <- i
df[df$cellType == c,]$cluster <- factor(cluster$cluster)
}
}
# For every celltype we will note the best ARI and the number of PCs used:
assign(paste("ARI", c, sep = "_"), max_ARI)
assign(paste("nr_PCs", c, sep = "_"), nr_PCs)
}
# We then create a dto display the afforementioed details
f_labels <- data.frame(
celltypes = c("Ciliated", "Macrophage", "NK", "Neutrophil", "Plasma", "Secretory", "Squamous", "T"),
label = c(
paste("PCS used =", nr_PCs_Ciliated, " | ARI =", round(ARI_Ciliated, 4),collapse=' '),
paste("PCS used =", nr_PCs_Macrophage, " | ARI =", round(ARI_Macrophage, 4),collapse=' '),
paste("PCS used =", nr_PCs_NK, " | ARI =", round(ARI_NK, 4),collapse=' '),
paste("PCS used =", nr_PCs_Neutrophil, " | ARI =", round(ARI_Neutrophil, 4),collapse=' '),
paste("PCS used =", nr_PCs_Plasma, " | ARI =", round(ARI_Plasma, 4),collapse=' '),
paste("PCS used =", nr_PCs_Secretory, " | ARI =", round(ARI_Secretory, 4),collapse=' '),
paste("PCS used =", nr_PCs_Squamous, " | ARI =", round(ARI_Squamous, 4),collapse=' '),
paste("PCS used =", nr_PCs_T, " | ARI =", round(ARI_T, 4),collapse=' ')))
# Finally we plot the best clustering results for each celltype
ggplot(df, aes(x = PC_1, y = PC_2 , color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.6) +
facet_wrap("cellType", scales = "free", labeller = labeller()) +
theme(axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(title = "Kmeans on PCA pr cell type", subtitle = "k = 2") +
xlab("PC_1") +
ylab("PC_2")
# With the afforementioed details:
f_labels
## celltypes label
## 1 Ciliated PCS used = 8 | ARI = 0.0955
## 2 Macrophage PCS used = 2 | ARI = 0.1096
## 3 NK PCS used = 2 | ARI = 0.2639
## 4 Neutrophil PCS used = 3 | ARI = 0.1383
## 5 Plasma PCS used = 2 | ARI = 0.5182
## 6 Secretory PCS used = 2 | ARI = 0.2311
## 7 Squamous PCS used = 5 | ARI = 0.0487
## 8 T PCS used = 8 | ARI = 0.1279
Then for hierarchical:
# same as the above but with a few differences
df["cluster"] <- 0 # reseatting cluster coloumn
for (c in celltypes){
c_subset = df[df$cellType == c,]
max_ARI = -1
nr_PCs = 0
for (i in c(2,3,4,6,10,15,20,30,50)){
o <- c_subset[, 9:(8+i)]
dm <- dist(o) # Distance matrix
hc <- hclust(dm, method = "centroid") # simple dendrogram.
#(since all of the above ended up using "centroid" we'll omit the other options)
clusterCutS <- cutree(hc, k=2) # clustering results (data, number of clusters)
ARI = adj.rand.index(clusterCutS, c_subset$PatientID) #ARI score
if (max_ARI< ARI) {
max_ARI = ARI
best_hc = hc
nr_PCs = i
df[df$cellType == c,]$cluster <- factor(clusterCutS)
}
}
assign(paste("ARI", c, sep = "_"), max_ARI)
assign(paste("nr_PCs", c, sep = "_"), nr_PCs)
}
f_labels <- data.frame(
celltypes = c("Ciliated", "Macrophage", "NK", "Neutrophil", "Plasma", "Secretory", "Squamous", "T"),
label = c(
paste("PCS used =", nr_PCs_Ciliated, " | ARI =", round(ARI_Ciliated, 4),collapse=' '),
paste("PCS used =", nr_PCs_Macrophage, " | ARI =", round(ARI_Macrophage, 4),collapse=' '),
paste("PCS used =", nr_PCs_NK, " | ARI =", round(ARI_NK, 4),collapse=' '),
paste("PCS used =", nr_PCs_Neutrophil, " | ARI =", round(ARI_Neutrophil, 4),collapse=' '),
paste("PCS used =", nr_PCs_Plasma, " | ARI =", round(ARI_Plasma, 4),collapse=' '),
paste("PCS used =", nr_PCs_Secretory, " | ARI =", round(ARI_Secretory, 4),collapse=' '),
paste("PCS used =", nr_PCs_Squamous, " | ARI =", round(ARI_Squamous, 4),collapse=' '),
paste("PCS used =", nr_PCs_T, " | ARI =", round(ARI_T, 4),collapse=' ')))
# we plot the best ARI
ggplot(df, aes(x = PC_1, y = PC_2 , color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.6) +
facet_wrap("cellType", scales = "free", labeller = labeller()) +
theme(axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(title = "Hierarchical on PCA", subtitle = "Clustered by infection; k = 2") +
xlab("PC_1") +
ylab("PC_2")
f_labels
## celltypes label
## 1 Ciliated PCS used = 2 | ARI = 0.3187
## 2 Macrophage PCS used = 3 | ARI = 0.0019
## 3 NK PCS used = 2 | ARI = 0.3159
## 4 Neutrophil PCS used = 3 | ARI = 0.003
## 5 Plasma PCS used = 2 | ARI = 0.4308
## 6 Secretory PCS used = 2 | ARI = 0.2839
## 7 Squamous PCS used = 2 | ARI = 0.44
## 8 T PCS used = 2 | ARI = 0.1033
df %>%
arrange(viralLoad) %>%
ggplot(aes(x = PC_1, y = PC_2 , color = PatientID)) +
geom_point(size = 1.5, alpha = 0.6) +
facet_wrap("cellType", scales = "free", labeller = labeller()) +
theme(axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(title = "PCA cell type") +
xlab("PC_1") +
ylab("PC_2")
# for more indeapth comments see "kmeans per celltype"
df["cluster"] <- 0
for (c in celltypes){
max_ARI = -1
nr_PCs = 0
c_subset = df[df$cellType == c,]
k = length(unique(c_subset$PatientID))
for (i in 2:50){
cluster <- kmeans(c_subset[, 9:8+i], iter = 10000, nstart = 100, k, algorithm="MacQueen")
ARI = adj.rand.index(cluster$cluster, c_subset$Is_infected)
if (ARI > max_ARI){
max_ARI <- ARI
nr_PCs <- i
df[df$cellType == c,]$cluster <- factor(cluster$cluster)
}
}
assign(paste("ARI", c, sep = "_"), max_ARI)
assign(paste("nr_PCs", c, sep = "_"), nr_PCs)
assign(paste("k", c, sep = "_"), k)
}
# Plot
f_labels <- data.frame(
celltypes = c("Ciliated", "Macrophage", "NK", "Neutrophil", "Plasma", "Secretory", "Squamous", "T"),
label = c(
paste("k =", k_Ciliated," | PCS used =", nr_PCs_Ciliated, " | ARI =", signif(ARI_Ciliated, 4),collapse=' '),
paste("k =", k_Macrophage," | PCS used =", nr_PCs_Macrophage, " | ARI =", signif(ARI_Macrophage, 4),collapse=' '),
paste("k =", k_NK," | PCS used =", nr_PCs_NK, " | ARI =", signif(ARI_NK, 4),collapse=' '),
paste("k =", k_Neutrophil, " | PCS used =", nr_PCs_Neutrophil, " | ARI =", signif(ARI_Neutrophil, 4),collapse=' '),
paste("k =", k_Plasma, " | PCS used =", nr_PCs_Plasma, " | ARI =", signif(ARI_Plasma, 4),collapse=' '),
paste("k =", k_Secretory, " | PCS used =", nr_PCs_Secretory, " | ARI =", signif(ARI_Secretory, 4),collapse=' '),
paste("k =", k_Squamous, " | PCS used =", nr_PCs_Squamous, " | ARI =", signif(ARI_Squamous, 4),collapse=' '),
paste("k =", k_T, " | PCS used =", nr_PCs_T, " | ARI =", signif(ARI_T, 4),collapse=' ')))
ggplot(df, aes(x = PC_1, y = PC_2 , color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.6) +
facet_wrap("cellType", scales = "free", labeller = labeller()) +
theme(axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(title = "K-means on PCA per cell type", subtitle = "Clustered by patient") +
xlab("PC_1") +
ylab("PC_2")
f_labels
## celltypes label
## 1 Ciliated k = 7 | PCS used = 2 | ARI = 0.1191
## 2 Macrophage k = 8 | PCS used = 3 | ARI = 0.04279
## 3 NK k = 8 | PCS used = 2 | ARI = 0.3868
## 4 Neutrophil k = 8 | PCS used = 2 | ARI = 0.1
## 5 Plasma k = 8 | PCS used = 26 | ARI = 0.2679
## 6 Secretory k = 8 | PCS used = 2 | ARI = 0.08749
## 7 Squamous k = 8 | PCS used = 2 | ARI = 0.1357
## 8 T k = 7 | PCS used = 44 | ARI = 0.207
# for more indeapth comments see chunk "hierarchical per celltype"
df["cluster"] <- 0
for (c in celltypes){
c_subset = df[df$cellType == c,]
max_ARI = -1
nr_PCs = 0
k = length(unique(c_subset$PatientID)) # some celltype does not contain cells from all patients
for (i in c(2,3,4,6,10,15,20,30,50)){
o <- c_subset[, 9:(8+i)]
dm <- dist(o) # Distance matrix
hc <- hclust(dm, method = "centroid")
clusterCutS <- cutree(hc, k)
ARI = adj.rand.index(clusterCutS, c_subset$PatientID)
if (max_ARI< ARI) {
max_ARI = ARI
best_hc = hc
nr_PCs = i
df[df$cellType == c,]$cluster <- factor(clusterCutS)
}
}
assign(paste("ARI", c, sep = "_"), max_ARI)
assign(paste("nr_PCs", c, sep = "_"), nr_PCs)
assign(paste("k", c, sep = "_"), k)
}
f_labels <- data.frame(
celltypes = c("Ciliated", "Macrophage", "NK", "Neutrophil", "Plasma", "Secretory", "Squamous", "T"),
label = c(
paste("k =", k_Ciliated," | PCS used =", nr_PCs_Ciliated, " | ARI =", round(ARI_Ciliated, 4),collapse=' '),
paste("k =", k_Macrophage," | PCS used =", nr_PCs_Macrophage, " | ARI =", round(ARI_Macrophage, 4),collapse=' '),
paste("k =", k_NK," | PCS used =", nr_PCs_NK, " | ARI =", round(ARI_NK, 4),collapse=' '),
paste("k =", k_Neutrophil, " | PCS used =", nr_PCs_Neutrophil, " | ARI =", round(ARI_Neutrophil, 4),collapse=' '),
paste("k =", k_Plasma, " | PCS used =", nr_PCs_Plasma, " | ARI =", round(ARI_Plasma, 4),collapse=' '),
paste("k =", k_Secretory, " | PCS used =", nr_PCs_Secretory, " | ARI =", round(ARI_Secretory, 4),collapse=' '),
paste("k =", k_Squamous, " | PCS used =", nr_PCs_Squamous, " | ARI =", round(ARI_Squamous, 4),collapse=' '),
paste("k =", k_T, " | PCS used =", nr_PCs_T, " | ARI =", round(ARI_T, 4),collapse=' ')))
ggplot(df, aes(x = PC_1, y = PC_2 , color = factor(cluster))) +
geom_point(size = 1.5, alpha = 0.6) +
facet_wrap("cellType", scales = "free", labeller = labeller()) +
theme(axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(title = "Hierarchical on PCA per cell type", subtitle = "Clustered by patient") +
xlab("PC_1") +
ylab("PC_2")
f_labels
## celltypes label
## 1 Ciliated k = 7 | PCS used = 3 | ARI = 0.4372
## 2 Macrophage k = 8 | PCS used = 3 | ARI = 0.1901
## 3 NK k = 8 | PCS used = 3 | ARI = 0.5139
## 4 Neutrophil k = 8 | PCS used = 3 | ARI = 0.5518
## 5 Plasma k = 8 | PCS used = 4 | ARI = 0.3178
## 6 Secretory k = 8 | PCS used = 2 | ARI = 0.1457
## 7 Squamous k = 8 | PCS used = 2 | ARI = 0.5997
## 8 T k = 7 | PCS used = 3 | ARI = 0.2506
And that’s the essence of our data-exploration with PCA. For similar documents using other techniques see Clean_t-SNE, -UMAP, and -scVi. For information abut the data see Clean_summary_statistics or Clean_data_subsetting. For some more clustering see Clean_clustering.